R + Quarto: How we developed a pipeline to create >3500 html factsheets

Gabe Morrison

Introduction and Background

Background: Who am I

  1. MS in Computational Analysis and Public Policy and BA in Geographical Sciences from the University of Chicago
  1. Data Scientist at the Urban Institute

Spatial Equity Data Tool

Synthetic Data

Mobility Metrics

What Are the Mobility Metrics:

For each predictor, there are one or two metrics

Problem Statement:

  • We want a way to display the information we’ve collected for every county and “large” city
    • That is \(3183 + 486 = 3669\) factsheets!!
  • We may need to update these, so we want a pipeline

What do these factsheets look like?

They look like this

Technological Background:

purrr

map(.x, .f, ..., .progress = FALSE)

purrr example

library(tidyverse)


example_function <- function(a, b) {
  return(b/a)
}
  
df <- data.frame(a = 1:5, b = 11:15)
df
  a  b
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15

purrr example

pmap(.l = df, .f = example_function)
[[1]]
[1] 11

[[2]]
[1] 6

[[3]]
[1] 4.333333

[[4]]
[1] 3.5

[[5]]
[1] 3

furrr

  • Like purrr but parallelizes across cores
  • Let’s look at a nice example from their documentation

AWS:

EC2 Ideas:

  • Elastic Compute Cloud gives a lot of options
    • Pricing: c6a.32xlarge = $4.82/hour
  • Cloud computing costs are cheap relative to human labor:
    • No AWS: \(5\:hours * 50/hour = \$250\)
    • AWS: \(4\:hours * 55/hour = \$220\)
  • R on AWS - R-specific docker image that is set up to run upon spin-up

Let’s look at some code!

Running in the Cloud

  1. Spin up large EC2 instance (c6a.32xlarge)
  2. ssh into ec2 instance
  3. Clone repo
  4. Further configure instance:
    1. Update quarto
    2. Update packages and get folders set up
  5. Call render_standard_pages.R
  6. Copy to S3 with bash commands

Key Takeaway: